Hierarchical Clustering#
Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters (not just partition of the sample into disjoint sets, but a system of nested partitions).
There are two stategies for hierachy clustering:
Divisive or top-down algorithms split the sample into smaller and smaller clusters
Agglomerative or bottom-up algorithms, in which objects are combined into larger and larger clusters
The agglomerative approach is more popular and effective. The algorithm itself looks pretty simple and consists of the following steps:
Initialization: Create as many clusters as there are objects, each object in its own separate cluster.
Repeat: Do following steps until the stopping criterion is met
Compute distances: Calculate the pairwise distance or similarity between each pair of clusters.
Merging: Combine of the two closest clusters into a new cluster.
Distance Matrix and its calculation#
The main data structure used by hierarchical clustering algorithms, which allows them to keep track of distances between clusters, is the distance matrix.
Formally, if the dataset has n data points \(\{x_1, \ldots, x_n\}\), a distance matrix \(D\) is an \(n \times n\) matrix, where \(D_{i j}\) is the distance between points \(x_{i}\) and \(x_{j}\). By definition, the distance matrix is symmetric and its diagonal entries are all zeros.
Usually, the distance between two data points \(x, y \in\mathbb R^d\) is calculated with metrics such as:
Euclidean distance: \(\hspace{2mm} \rho(x, y)=\sqrt{\sum\limits_{i=1}^d(x_i - y_i)^2}\)
Squared Euclidean distance: \(\hspace{2mm} \rho(x, y)=\sum\limits_{i=1}^d(x_i - y_i)^2\)
Manhattan Distance: \(\hspace{2mm} \rho(x, y)=\sum\limits_{i=1}^d |x_i - y_i|\)
Cosine Distance: \(\hspace{2mm} \rho(x, y)=\frac{\sum\limits_{i=1}^dx_i y_i}{\sqrt{\sum\limits_{i=1}^dx_i^2}\sqrt{\sum\limits_{i=1}^dy_i^2}}\)
Linkage and Distance#
At first, each object is considered as one cluster. For single element clusters, the distance function is naturally defined:
\(R(\{x\},\{y\})=\rho (x, y)\)
The distances between objects \(\rho\) can be specified by any metric listed above. Then the process of clusters merging starts. At each iteration, instead of the pair of closest clusters \(U\) and \(V\), a new cluster \(W = U \cup V\) is formed. There are several methods of identifing closest clusters:
Single linkage aka Nearest Point Algorithm: The distance between clusters is defined by the distance between their closest members.
Complete linkage aka Farthest Point Algorithm: The distance between clusters is defined by the distance between their furthest members.
Unweighted Pair Group Method with Arithmetic mean (UPGMA): Average-average distance or average linkage is the method that involves looking at the distances between all pairs and averages all of these distances.
Unweighted Pair Group Method with Centroid average (UPGMC): A point defined by the mean of all points (centroid) is calculated for each cluster and the distance between clusters is the distance between their respective centroids.
Ward’s method: It is based on the increase in the sum of squared errors (ESS) that results from merging two clusters. It aims to minimize the variance within the newly formed cluster. The idea is to choose the pair of clusters for which the merging leads to the smallest increase in the total within-cluster sum of squares.
At each step, it is good to be able to quickly calculate the distance from the formed cluster \(W = U \cup V\) to any other cluster \(S\), using known distances from previous steps. This is easily accomplished using the equation proposed by Lance and Williams in 1967:
Note
For each of the above distance functions, compliance with the Lance-Williams equation has been proven for certain combinations of parameters
Single linkage: \(\alpha_{U}=0.5, \alpha_{V}=0.5, \beta=0, \gamma=-0.5\)
Complete linkage: \(\alpha_{U}=0.5, \alpha_{V}=0.5, \beta=0, \gamma=0.5\)
UPGMA: \(\alpha_{U}=\dfrac{|U|}{|W|}, \alpha_{V}=\dfrac{|V|}{|W|}, \beta=0, \gamma=0\)
UPGMC: \(\alpha_{U}=\dfrac{|U|}{|W|}, \alpha_{V}=\dfrac{|V|}{|W|}, \beta=-\alpha_{U}\cdot \alpha_{V}, \gamma=0\)
Ward’s method: \(\alpha_{U}=\dfrac{|S|+|U|}{|S|+|W|}, \alpha_{V}=\dfrac{|S|+|V|}{|S|+|W|}, \beta=\dfrac{-|S|}{|S|+|W|}, \gamma=0\)
Pros and Cons#
Each method has its pros, as they have their cons. Firstly let’s you can see them bellow on table:
Method |
Pros |
Cons |
|---|---|---|
Single linkage |
• Can capture clusters of non-elliptical shapes, |
• Sensitive to noise and outliers. |
Complete linkage |
• Less susceptible to noise and outliers. |
• Could unite rather dissimilar groups at an early stage. |
UPGMA |
• Provides a balance between single and complete linkage. |
• Biased towards globular clusters. |
UPGMC |
• Quite intuitive. |
• Less suitable for capturing non-elliptical clusters. |
Ward’s method |
• Tends to produce compact and equally-sized clusters (similar to k-means). |
• Assumes that the clusters have approximately the same size. |
There is no exact answer to the question of which linkage algorithm is better. Each of the distances listed above has its own disadvantages and is not suitable for all tasks. But in compliance with the experience of most data scientists, it has been proven that Ward’s method turned out to be the best according to the results of experimental comparison on a representative set of model problems. It recovers the intuitively best clustering more often than other methods.
Dendrogram#
As clusters merge, each iteration of the algorithm corresponds to a pair of clusters merged at this iteration, as well as the distance between the clusters at the moment of merging. Distances will only increase with iteration, so it becomes possible to visualize the result in the form of a beautiful cluster tree called a dendrogram.
On the diagram above, the synthetic data is shown on the left, and the dendrogram itself based on this data is on the right. On the dendrogram, sample objects are marked along the horizontal axis, distances between clusters are plotted vertically. Cluster merges correspond to horizontal lines.
We can set linkage distance threshold at or above which clusters will not be merged. Play with slider to understand how clusters change depending on threshold value.
Number of clusters
To determine the number of clusters, the interval of maximum length \(|R_{t+1}−R_{t}|\) is found. The final clusters are the clusters obtained at step \(t\). In this case, the number of clusters is equal to \(m−t+1\). However, when the number of clusters is unknown in advance and there are not very many objects in the sample, it can be useful to look at full dendrogram.
Example: Iris Dataset#
About Dataset: The Iris flower dataset is a classic dataset in the field of machine learning and statistical analysis. It consists of 150 observations of iris flowers, including the sepal and petal length and width for each flower, as well as the species of the flower.
About Features:
Name |
Description |
|---|---|
sepal_length |
Sepal length, in centimeters, used as input. |
sepal_width |
Sepal width, in centimeters, used as input. |
petal_length |
Petal length, in centimeters, used as input. |
petal_width |
Petal width, in centimeters, used as input. |
class |
Iris Setosa, Versicolor, or Virginica, used as the target. |
import plotly.express as px
from scipy.cluster.hierarchy import dendrogram, linkage
df_iris = px.data.iris()
dist_sin = linkage(df_iris.loc[:,["sepal_length","sepal_width","petal_length","petal_width"]], method="single")
plt.figure(figsize=(18,6))
dendrogram(dist_sin, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("Dendogram Single Method",fontsize=18)
plt.show()
dist_comp = linkage(df_iris.loc[:,["sepal_length","sepal_width","petal_length","petal_width"]],method="complete")
plt.figure(figsize=(18,6))
dendrogram(dist_comp, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("Dendogram Complete Method",fontsize=18)
plt.show()
Example: Wine Dataset#
About Dataset: These data are the results of chemical analysis of wines grown in the same region of Italy, but obtained from three different varieties. As a result of the analysis, the amount of 13 components contained in each of the three types of wines was determined.
About Features:
Name |
Description |
|---|---|
Alcohol |
Percentage of alcohol in wine. This parameter measures the quantitative ethanol content of the wine and affects its strength. |
Malic Acid |
The amount of malic acid in wine. Malic acid gives wine freshness and brightness. |
Ash |
The amount of minerals (ash) in wine after the water has evaporated and the residues have been burned. It reflects the minerality of the wine. |
Alcalinity of Ash |
Alkalinity of ash in wine. Alkalinity measures the pH level of a wine and affects its flavor profile. |
Magnesium |
Amount of magnesium in wine. Magnesium is one of the trace elements that can affect the taste and aroma of wine. |
Total Phenols |
The total amount of phenolic compounds in wine. Phenols are antioxidants and can affect the taste and color of wine. |
Flavanoids |
The amount of flavonoids in wine. Flavonoids are also phenolic compounds and can contribute to the flavor and color of wine and also have antioxidant properties. |
Nonflavanoid Phenols |
Amount of non-flavonoid phenolic compounds in wine. |
Proanthocyanins |
The amount of proanthocyanidins in wine. Proanthocyanidins also belong to the group of phenolic compounds. |
Color Intensity |
The color intensity of a wine is measured as the absorption of light at a specific wavelength. This parameter is related to the depth of color of the wine. |
Hue |
The shade of a wine is measured on a color scale. This value can range from orange to purple and is related to the color subtlety of the wine. |
OD280/OD315 of Diluted Wines |
The optical density of wine at a specific wavelength. This parameter may be related to the wine’s content of anthocyanins (pigments that give wine its red color). |
Proline |
The amount of amino acid proline in wine. Proline can affect the texture and structure of wine. |
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram
import plotly.figure_factory as ff
#Importing dataset
df_wine = pd.read_csv("assets/wine-clustering.csv")
df_wine.head()
| Alcohol | Malic_Acid | Ash | Ash_Alcanity | Magnesium | Total_Phenols | Flavanoids | Nonflavanoid_Phenols | Proanthocyanins | Color_Intensity | Hue | OD280 | Proline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
| 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
| 2 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
| 3 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
| 4 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
Let’s see the correlation matrix of a dataset.
Correlation matrix shows how much the features have a linear relationship (that is, whether the feature is a linear combination of another feature or not).
corr = df_wine.corr()
fig = plt.subplots(figsize = (8,6))
sns.heatmap(corr, annot = True, fmt='.2f')
plt.title('Correlation matrix for Wine dataset', fontsize = 18, fontweight = 'bold')
plt.show()
The highest correlation coefficient is 0.86, that is, as Total_Phennols increases, Flavanoisd increases.
There is also a negative correlation between some features (that is, a decrease in one feature entails an increase in another and vice versa).
Overall, the correlation between features is mostly in the range of -0.2 to 0.3.
Since there is no such thing as multicollinearity in this set, we will not remove variables and will work with the full data set.
Standardization of features#
scaler = StandardScaler()
df_2 = scaler.fit_transform(df_wine)
dist_comp = linkage(df_2, method="ward")
plt.figure(figsize=(18,6))
dendrogram(dist_comp, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("Dendrogram using Ward Method", fontsize=18)
plt.show()
The Ward’s method seeks to minimize the increase in the total within-cluster variance after merging clusters. It tends to form clusters with more uniform dispersion within the cluster. As can be seen from this dendrogram, it is best to divide this data set into 3 clusters (that is, define 3 classes). We will identify a model that will have the following hyperparameters:
n_clusters = 3, linkage = ‘ward’
And let’s make a plot to see our clusters:
import hvplot.pandas
import panel as pn
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 3, linkage ='ward')
y_hc = hc.fit_predict(df_wine)
df_wine['cluster'] = pd.DataFrame(y_hc)
x = pn.widgets.Select(name='x', value='Proline',
options=['Alcohol', 'Malic_Acid','Ash','Ash_Alcanity',
'Magnesium','Total_Phenols', 'Flavanoids',
'Nonflavanoid_Phenols', 'Proanthocyanins','Color_Intensity','Hue',
'OD280', 'Proline'])
y = pn.widgets.Select(name='y', value='Color_Intensity',
options=['Alcohol', 'Malic_Acid','Ash','Ash_Alcanity',
'Magnesium','Total_Phenols', 'Flavanoids',
'Nonflavanoid_Phenols', 'Proanthocyanins','Color_Intensity','Hue',
'OD280', 'Proline'])
kind = pn.widgets.Select(name='kind', value='scatter',
options=['scatter'])
by_cluster = pn.widgets.Checkbox(name='cluster', value = True)
color = pn.widgets.ColorPicker(value='#0000ff')
@pn.depends(by_cluster, color)
def by_cluster_name(by_cluster, color):
return 'cluster' if by_cluster else color
plot = df_wine.hvplot(x=x, y=y, kind=kind, c=by_cluster_name, colorbar=True, width=600, title = 'Interactive plot to see clusters variation based on features')
pn.Row(pn.WidgetBox(x, y, kind, color, by_cluster), plot)
You’re free to make experiments, and see how clusters look with each feature of dataset
Performance evaluation indices#
Silhouette Score
Silhouette Score is a widely used metric to evaluate the quality of clustering results. It measures how similar a data point is to its own cluster compared to other clusters. The score ranges from -1 to 1, with a higher value indicating better clustering performance.
Kalinski-Harabasz Index
The Kalinski-Harabasz index, also known as the coefficient of variance test, is another metric to evaluate clustering performance. It measures the ratio of variance between clusters to variance within a cluster. A higher Kalinski-Harabasz index value indicates better clustering performance with higher separation between clusters and less variance within clusters.
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score
hc.fit(df_wine)
labels = hc.labels_
labels
silhouette = silhouette_score(df_wine, labels)
chi = calinski_harabasz_score(df_wine, labels)
print('Silhouette Score', round(silhouette,3))
print('Kalinski-Harabasz Index', round(chi,3))
Silhouette Score 0.564
Kalinski-Harabasz Index 552.856